feat: add retrieval event logging and evaluation harness#5
Conversation
- Add RETRIEVAL_EVENT_LOGGING_ENABLED (default false) for detailed
search/fetch telemetry in separate retrieval_query_events collection
(90-day TTL via Firestore, linked to parent tool_call events)
- New retrievalEvaluation.ts implementing:
- FirestoreRetrievalEvalCaseStore + replace/list helpers
- buildSyntheticEvalCases + correlateImplicitFetches (search+fetch
pairs within time window become positive labels)
- computeRetrievalEvalMetrics (hit@k, recall@k, MRR, zero-result rate,
p50/p95 latency)
- New functions/scripts/retrieval-eval.ts + package.json scripts:
eval:generate (plant isolated corpus + cases in memory_vectors_eval)
eval:import (turn recent production retrieval events into cases)
eval:run (--mode isolated|production, --cutoffs, reports JSON metrics)
- Wire new config (retrievalEventLoggingEnabled, topK, memoryCollection)
through app.ts into createMetaCortexMcpServer and observer
- mcpServer: extend execute wrapper to accept optional retrieval
descriptors; search_context and fetch_context now emit rich events
when enabled (with scores converted from cosine distance)
- Test updates: InMemoryToolCallObserver.listRetrievalEvents, integration
assertions, config boolean parser tests, fakes
- Smoke test: --fetch-first / MCP_FETCH_FIRST to drive fetch after search
- Infra: firestore indexes for memory_vectors_eval, rules deny for new
collections (retrieval_query_events, retrieval_eval_cases, *_eval*),
deploy-firestore-ttl.sh updated
- Docs + .env.example updated
All tests green (49 passing).
|
@claude review and how long do you think i'll need to run this in prod before we start looking at the analytics? |
|
Claude finished @chrishonson's task in 2m 26s —— View job Code Review + Analytics TimelineTasks
ReviewOverall this is clean, well-structured work. The architecture — opt-in flag, separate collection, dedicated eval partition, implicit-fetch correlation for free labels — is solid. A few things worth looking at: Correctness / Subtle Bugs
When // at the top of the reduce callback for recall
if (positiveIds.size === 0) return total; // ← today this silently penalizes; instead skip denominator too
The Fetch error path emits no On a 404 from Minor / NitsHome-rolled
Both collections end up with the same 90-day TTL, but the constant's name implies it belongs to
Probably intentional — eval cases persist across runs by design. But the TTL script covers three of the four new collections and is silent on why
Node will print the stack trace on rejection, which is fine. But wrapping it in What's Good
How Long to Run in Production Before Looking at Analyticstl;dr: ~1 week for a first signal, 3–4 weeks for a stable baseline you can regression-test against. The
A few things that affect how fast you get there:
|
Summary
Opt-in retrieval observability + a full offline + production retrieval evaluation harness.
What
RETRIEVAL_EVENT_LOGGING_ENABLED=true,search_contextandfetch_contextnow write rich events (full query, filters, ranked result ids + derived scores, fetch ids) to a dedicatedretrieval_query_eventscollection (90 d TTL, deny-all rules, linked to the regular tool events).retrievalEvaluationmodule + CLI:eval:generate– creates a deterministic synthetic corpus in the isolatedmemory_vectors_evalcollection and the corresponding goldenretrieval_eval_cases.eval:import– converts real production search→fetch observations (viacorrelateImplicitFetches) into labeled cases.eval:run– executes the cases (either directly against the eval partition or against a live MCP endpoint) and computes standard metrics (hit@k, recall@k, MRR, zero-result rate, p50/p95 latency).Motivation
Gives us repeatable, automatable measurement of retrieval quality using either controlled fixtures or actual usage signals (successful fetches after searches are treated as positives).
New scripts
npm --prefix functions run eval:generate npm --prefix functions run eval:import -- --lookback-hours 168 npm --prefix functions run eval:run -- --mode isolated # or productionVerification
npm --prefix functions test→ 49/49 passing--fetch-firsthelperSee the "Retrieval evaluation" section added to the README for usage.